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Abstract 



We report on simultaneous recordings from cells in all layers of visual cortex 
and models developed to capture the higher order structure of population spiking 
activity. Specifically, we use Ising, Restricted Boltzmann Machine (RBM) and 
semi-Restricted Boltzmann Machine (sRBM) models to reveal laminar patterns 
of activity. While the Ising model describes only pairwise couplings, the RBM 
and sRBM capture higher-order dependencies using hidden units. Applied to 32- 
channel polytrode data recorded from cat visual cortex, the higher-order mod- 
els discover functional connectivity preferentially linking groups of cells within 
a cortical layer. Both RBM and sRBM models outperform Ising models in log- 
likelihood. Additionally, we train all three models on spatiotemporal sequences 
of states, exposing temporal structure and allowing us to predict spiking from 
network history. This demonstrates the importance of modeling higher order in- 
teractions across space and time when characterizing activity in cortical networks. 

1 Introduction 

Electrophysiology is rapidly moving towards high density recording techniques capable of capturing 
the simultaneous activity of large populations of neurons. This raises the challenge of understanding 
how networks encode and process information in ways that go beyond feedforward receptive field 
models for individual neurons. Modeling the distribution of states in a network provides a way to 
discover communication patterns and functional connectivity between cells. 

The Ising model, originally developed in the 1920s to describe magnetic interactions |T|, has been 
used to describe electrophysiological data, particularly in the retina |2|, and more recently for corti- 
cal recordings |3 |. This model treats spikes from a population of neurons binned in time as binary 
vectors and captures dependencies between cells with the maximum entropy distribution for pair- 
wise correlation. However, this only provides a good approximation for small groups of cells in the 
retina i4j| . 

In this work, we apply maximum entropy models to data from the visual cortex. Cortical networks 
have proven to be more challenging to model than the retina: even the existence of significant 
pairwise correlations between cortical cells is controversial 1 5 , 6 1 and higher order correlations play 
an important role |[7l[8][9l. One of the challenges with current recording technologies is that we 
cannot simultaneously record from more than a tiny fraction of the cells that make up a cortical 
circuit. Sparse sampling together with the complexity of the circuit mean that the majority of a 
cell's input will be from cells that are not part of the recording. In adult cat visual cortex, direct 
synaptic connections have been reported to occur between 11% - 30% of nearby pairs of excitatory 
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neurons in layer IV ifTOl , while a larger fraction of cell pairs show "polysynaptic" couplings ifTTIl . 
defined by a broad peak in the cross-correlation between two cells. This type of coupling can be 
due to common inputs (either from a different cortical area or in the form of lateral connections) or 
a series of monosynaptic connections. A combination of these is believed to give rise to most of the 
statistical interactions between the recorded cells. The Ising model, which assumes only pairwise 
couplings, is well suited to model direct synaptic coupling, but cannot deal with interactions that 
include many cells simultaneously. Therefore, we propose a new approach, that addresses both 
incomplete sampling and tentative inputs from larger scale cell assemblies. We extend the Ising 
model with a layer of hidden units or latent variables. The resulting model is a semi-Restricted 
Boltzmann Machine (sRBM), which combines pairwise connections between visible units with an 
additional set of connections to hidden units. 

Estimating the parameters of Ising models and Boltzmann machines is hard because probabilities 
are only defined up to a normalization constant. For both Ising models and Boltzmann machines 
with hidden units, this normalization constant is computationally intractable, requiring a sum over 
the exponential number of states of the system. This makes exact maximum likelihood estima- 
tion impossible for all but the smallest systems and has previously necessitated approximate and/or 
computationally expensive estimation methods. In this work, we use Minimum Probability Flow 
(MPF 1 12, 13 1, in the context of neural decoding see |[T4ll ) to estimate parameters efficiently without 
computing the intractable partition function. This allows us to estimate Ising models on higher- 
dimensional data than is otherwise feasible, and to estimate the sRBM in a straightforward way. 
Similar to parameter estimation, model evaluation is also hampered by the fact that the learned mod- 
els are not normalized. To compute probabilities and compare the likelihood of different models, 
annealed importance sampling ifTSl was used to estimate the partition function. 

Combining these two methods for model estimation and evaluation, we show that the sRBM can 
capture the distribution of states in a cortical network of tens of cells recorded from cat visual cortex 
significantly better than a pairwise model. The higher order structure discovered by the model is 
spatially organized and specific to cortical layers, indicating that networks within individual layers 
of a microcolumn are the dominant source of correlations. Applied to spatiotemporal patterns of 
activity, the model captures temporal structure in addition to dependencies across different cells, 
allowing us to predict spiking activity based on the history of the network. 

2 Methods 

2.1 Recording and experimental procedures 

Data were recorded from anesthetized cat visual cortex in response to a custom set of full field 
natural movie stimuli. Movies of 8 to 30 minutes duration were captured at 300 frames per second 
and 512 X 384 pixel resolution with a Casio Fl camera. Care was taken to avoid scene changes and 
camera motion, so the spatiotemporal statistics of the movie correspond to those of natural scene 
motion. Movies were converted to grayscale and down-sampled to 150 Hz for presentation. The 
high frame rate was chosen to prevent cells phase locking to the frame rate, and scene changes were 
minimized to avoid evoked potentials due to sudden luminance changes. Fig[T]:) shows an example 
frame crop from one of the natural scene movies. 

Recordings were made with single shank 32 channel polytrodes (Fig. [T^) with a channel spacing 
of 50/im, spanning all the layers of visual cortex. Individual datasets had on the order of of 20 to 
40 simultaneously recorded neurons. The data was spike sorted offline using k-means clustering 
(KlustaKwik, 1 16|) with a manual cleanup step (MClust, ifTTl ). Unless noted otherwise, spikes were 
binned at 20ms where bins with a single spike (2.4% of bins) and multiple spikes (0.9% of bins) were 
both treated as spiking and the rest as non- spiking. The bin size was chosen to span the width of the 
central peak in cross-correlograms between pairs of cells such as shown in[T^), where the dashed 
vertical lines at ± 20ms envelope the central peak. An example of binned data for 23 isolated single 
units is shown in[T]3) with black marks indicating spiking. 

To register individual recording channels with cortical layers, recording locations were reconstructed 
from Nissl-stained histological sections, and current source density analysis in response to flashed 
full-field stimuli was used to infer the location of cortical layer IV on the polytrode 11 8J . In[TJ)) we 
use horizontal lines to indicate the upper and lower boundaries of layer IV. 
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Figure 1: (a) Recordings were made from columns of cat visual cortex. The 32 channel probe is 
shown to scale, (b) Example data from 23 cells, binned in 20ms windows, 2s of data. Columns 
of this matrix are the input to our algorithm. For the spatiotemporal version of the model, we 
concatenate several adjacent columns, (c) Example of one of the natural movie stimuli presented, 
showing a group of ducks, (d) Histogram of instantaneous correlations between pairs of cells, (e) 
Cross-correlation as a function of time lag between a pair of cells binned at 6.7ms shows a peak in 
the correlation about 20ms wide (dashed lines). This determines the window size for binning the 
data. 



The surgical methods are described in detail elsewhere 1 19 |. The protocol used in the experiments 
was approved by the Institutional Animal Care and Use Committee at Montana State University 
and conformed to the guidelines recommended in Preparation and Maintenance of Higher Mam- 
mals During Neuroscience Experiments, National Institutes of Health Publication 913207 (National 
Institutes of Heath, Bethesda, MD 1991). 

The models were estimated on a data set of 180, 000 data vectors corresponding to 60 minutes 
of recording time. The data were split into two subsets of 90, 000 data points: a training set for 
parameter estimation and a test set to compute cross- validated likelihoods. 

We also analyzed spatiotemporal patterns of data, which were created by concatenating consecutive 
state vectors. For the spatiotemporal experiments a bin width of 6.7 ms, corresponding to the frame 
rate of the stimulus, was used. This bin size is a compromise capturing more detailed structure in 
the data without leading to an undue increase in dimensionality and complexity. Up to 10 time bins 
were concatenated in order to discover spatiotemporal patterns and predict spiking given the history 
over the prior 67ms. These models were trained on 100, 000 samples and likelihoods were computed 
on a set of equal size. 

2.2 Model and Estimation 

The sRBM consists of a set of binary visible units x G {0, 1}^ corresponding to observed neurons 
in the data and a set of hidden units h G {0, 1}^ that capture higher order dependencies. Weights 
between visible units, corresponding to an Ising model or fully visible Boltzmann machine, capture 
pairwise correlations in the data. Weights between visible and hidden units, corresponding to an 
REM, learn to describe higher order structure. 

The Ising model with visible-visible coupling weights J G JZ^^^ and biases b G TZ^ has an 
energy function 

Ei(x) = -x^Jx-b^x, (1) 
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with associated probabihty distribution pi (x) = ^ exp [— ^i(x)], where the normahzation con- 
stant, or partition function, Zi = X]{x'} ^^P consists of a sum over all 2^ system states. 

The RBM with visible-hidden coupling weights W G Jl^^^ and hidden and visible biases G 
and hh e has an energy function 

£;r(x, h) = -x^Wh - b^x - b^h, (2) 

with associated probability distribution pR (x, h) = ^ exp [— £^r(x, h)]. Since the there are no 
connections between hidden units (hence "restricted" Boltzmann machine), the latent variables h 
can be analytically marginalized out of the distribution (see Appendix A) to obtain 

P{^) = J dhp (x, h) = ^ exp [-£;r(x)] , (3) 

where the energy for the marginalized distribution over x (sometimes referred to as the free energy 
in machine learning literature) is 

i?R(x) = -J2 log(l + 6'^'^+'"^-) - b^X, (4) 

i 

where are rows of the matrix W. The energy function for an sRBM combines the Ising model 
and RBM energy terms, 

(x, h) = -x^Jx - x^Wh - b^x - b^h. (5) 
As with the RBM, it is straightforward to marginalize over the hidden units for an sRBM, 

ps(x) = ^exp[-^s(x)], (6) 

Es(x) = -x^Jx - ^log(l + e-^^^+^^'O - b^x. (7) 

A hierarchical Markov Random Field based on the sRBM has previously been applied as a model for 
natural image patches |20|, with the parameters estimated using contrastive divergence (CD, |l2Tl|). 

Instead of CD, which is based on sampling, we train the models using Minimum Probability Flow 
(MPF, 1 12 1), a recently developed estimation method for energy based models. MPF works by 
minimizing the KL divergence between the data distribution and the distribution which results from 
moving slightly away from the data distribution towards the model distribution. This KL divergence 
will be uniquely zero in the case where the model distribution is identical to the data distribution. 
While CD is a stochastic heuristic for parameter update, MPF provides a deterministic and easy 
to evaluate objective function. Second order gradient methods can therefore be used to speed up 
optimization considerably. The MPF objective function 



^ = E E 5 (x, x') exp [1 [i?(x) - E{^')] 



(8) 



measures the flow of probability out of data states x into neighboring non-data states x^ where the 
connectivity function g (x, x') G {0, 1} identifies neighboring states, and V is the list of data states. 
We consider the case where the connectivity function g (x, x') is set to connect all states which differ 
by a single bit flip 



ofxx') = l ^ ^f(x,x') = l 

^ ^ ' ' 10 otherwise 



(9) 



where R (x, x') is the Hamming distance between x and x^ See Appendix Bplfor a derivation of the 
MPF objective function and gradients for the sRBM, RBM, and Ising models. In all experiments, 
minimization of K was performed with the MinFunc implementation of L-BFGS ll22ll . 



To prevent overfitting all models were estimated with an Li sparseness penalty on the coupling 
parameters. This was done by adding a term of the form A ^ • ^ | J^j | + A | W^/c | to the objective 



^Code is available for download at |http ://github. com/S ohl-Dickstein 
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function, summing over the absolute values of the elements of both the visible and hidden weight 
matrices. The optimal sparseness A was chosen by cross- validating the log-likelihood on a holdout 
set. 

Since MPF learning does not give an estimate of the partition function, we use annealed importance 
sampling (AIS, 1 15 1) to compute normalized probabilities. AIS is a sequential Monte Carlo method 
that works by gradually morphing a distribution with a known normalization constant (in our case a 
uniform distribution over x) into the distribution of interest. See Appendix C for more detail. AIS 
applied to RBM models is described in 1231 , which also highlights the shortcomings of previously 
used deterministic approximations. 

Normalizing the distribution via AIS allows us to compute the log likelihood of the model Pm and 
compare it to the likelihood gained over a baseline model. This baseline assumes cells to be inde- 
pendent and characterized by their firing rate Pr(x) = Hil^i^* + (1 — — ^i)) with rates 
for individual cells i. The independent model is easily estimated and normalized, and is commonly 
used as a reference for model comparison. The excess log likelihood over this baseline is defined in 
terms of an expectation per unit time as £ = Xlxer' [log2Pm(x) — log2Pr(x)], where r is the 
width of a time bin. The excess log likelihood rate £, measured in bits/s, is used as the basis for 
model comparisons. 

3 Results 

We estimated Ising, RBM and sRBM models for populations of cortical cells simultaneously 
recorded across all cortical layers in a microcolumn. Here we present data from two animals, one 
with 23 single units (B4), another with 36 units (T6), as well as a multiunit recording with 28 units 
(B4MU). In each case the population was verified to be visually responsive and the majority of cells 
were orientation selective simple or complex cells. 

The estimated model parameters for the three different types of models (Ising, RBM and sRBM) 
are shown in Fig. [2] The sparseness penalty, chosen to optimize likelihood on a validation dataset, 
results in many of the parameters being zero. For the Ising model (a) we show the coupling as a 
matrix plot, with horizontal lines indicating anatomical layer boundaries. The diagonal contains the 
bias terms (there is no self-coupling), which are negative since all cells are off the majority of the 
time. The matrix has many small positive weights that encourage positive pairwise correlations by 
lowering the energy of connected states being active simultaneously. 

In (b) we show the hidden units of the RBM as individual bar plots, with the bars representing con- 
nection strengths to visible units. The topmost bar corresponds to the hidden bias of the unit. Hidden 
units are ordered by variance. The units are highly selective in connectivity: The first unit almost 
exclusively connects to cells in the deep (granular and subgranular) cortical layers. The second and 
third unit capture correlations between cells in the superficial (supergranular) layers. The correla- 
tions are of high order, with 10 and more cells receiving input from a hidden unit. The remaining 
units connect fewer cells, but still tend to be location- specific. Unit five captures a predominantly 
pairwise correlation that is also visible in the Ising model coupling matrix. Only six out of the total 
23 hidden units have non-zero couplings and are shown. Additional hidden units are disabled by 
the LI sparseness penalty, which was chosen to maximize likelihood on the cross-validation dataset. 
The interpretation of hidden units is quite similar to the pairwise terms of the Ising model: positive 
coupling to a group of visible units encourages these units to become active simultaneously, as the 
energy of the system is lowered if both the hidden unit and the cells it connects to are active. Thus 
the hidden units become switched on when cells they connect to are firing (activation of hidden units 
not shown). 

The sRBM combines both pairwise and hidden connections and hence is show with a pairwise 
coupling matrix and bar plots for hidden units. Due to the larger number of parameters, the optimal 
model is even more sparse. The remaining pairwise terms now predominantly encode negative 
interactions, whereas much of the positive coupling has been explained away by the hidden units. 
These still provide strong positive couplings within either superficial (II/III) or intermediate (IV) 
and deep (VA^I) layers, which explain the majority of structure in the data. 

It is noteworthy that the preferred explanation for dependencies between recorded neurons is via 
connections to shared hidden units, rather than direct couplings between visible units. The RBM 



5 




5 

c) sRBM pairwise couplings and hidden units 



Figure 2: Connectivity patters of the three models estimated for a recording session with 23 spike 
sorted single units. The horizontal lines indicate approximate boundaries between cortical layers 
II/III, layer IV and layers VA^I. (a) Ising model with the bias terms on the diagonal. The model has 
many small coupling terms that encode positive correlations, (b) The RBM coupling weights are 
plotted as a bar chart for each hidden unit, ordered by activity from left to right. The first bar chart 
is the bias for all the visible units, and the extra bar at the top of each plot corresponds to the bias 
of that hidden unit. Blue bars indicate a sign flip (the bias terms are predominantly negative, but 
plotted with reversed sign for easier comparison with the remaining terms), (c) The sRBM weights 
are shown in the same way, with the pairwise couplings on the left and hidden units on the right. 
The pairwise connections are qualitatively very different from those of the Ising model, as most of 
the structure is better captured by hidden units. 
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Figure 3: (a) Model comparison using likelihood gain over the independent (firing rate) model. 
Likelihoods are normalized to bits/s, but not for model size, explaining the large difference between 
data sets of different size: 23 cells for session B4, 28 cells for MU, and 36 cells for T6. B4 and T6 
are spike sorted, MU is a multiunit dataset. All three models outperform the independent model by 
10-20 bits/s. The higher order models with hidden units give a further improvement of about 2 bit/s 
over the Ising model. 
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a) Independent vs. Ising model b) Ising vs. RBM model 
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Figure 4: Scatter plot of test data set showing empirical probabilities against model probabilities. 
Different models are encoded by color, the number of simultaneously spiking cells in each pattern 
by different symbols, (a) shows the independent model compared to the Ising model, (b) shows the 
Ising model compared to the RBM. The sRBM is omitted as it is very similar to the RBM. The RBM 
significantly outperforms simpler models. 



and sRBM in this comparison were both estimated with 25 hidden units, but we show only units 
that did not go to zero due to the sparseness constraint. In this case, A = 2xl0~^ was found to be 
optimal. Since at this level of sparseness, many of the hidden units turn off entirely, it was deemed 
unnecessary to further increase the number of hidden units. 

For a quantitative comparison between models, we computed normalized likelihoods using AIS to 
estimate the partition function. For each model, we generated 500 samples through a chain of 10^ 
annealing steps. To ensure convergence of the chain, we use a series of chains varying the number 
of annealing steps and verify that the estimate of Z stabilizes to within at least 0.02 bits/bin (see 
supplemental Fig. [7] in the Appendix). For models of size 20 and smaller we furthermore computed 
the partition function exactly to compare against the AIS estimate. 

Fig. 3] a) shows a comparison of excess log likelihood C over the independent firing rate model 
for the three different models and on all three datasets. C is computed in units of bits/s for the full 
population. Both higher-order models outperform the Ising model in fitting the datasets, significantly 
so for two datasets. Error bars are standard error on the mean, computed from 10 models with 
different random subsets of the data used for learning and validation, and different random seeds 
for the parameters and the AIS sampling runs. Each of the models was estimated for a range of 
sparseness parameters A = [0, 1, 2, 4, 6, 8, 10] x 10~^ bracketing the optimal A, and the results are 
shown for the optimal choice of A for each model. 

Additional insight into the relative performance of the models can be gained by comparing model 
probabilities to empirical probabilities for the various types of patterns. Fig.|4]shows scatter plots of 
model probabilities under the different models against pattern frequencies in the data. Patterns with 
a single active cell, two simultaneously active cells, etc. are distinguished by different symbols. As 
expected from the positive correlations, the independent model (yellow) shown in a) has to greatly 
overestimate the probabilities of cells being active individually, so these patterns fall above the iden- 
tity line, while all other patterns are underestimated. For comparison the Ising model is shown in the 
same plot (blue), and does significantly better, indicated by the points moving closer to the identity 
line. It still tends to fail in a similar way though, with many of the triplet patterns being underesti- 
mated as the model cannot capture triplet correlations. In b), this model is directly compared against 
the RBM (green). Except for very rare patterns, most points are now very close to the identity line, 
as the model can fully capture higher order dependencies. Hidden units describe global dependen- 
cies that greatly increase the frequency of high order patterns compared to individually active cells. 
The 5% and 95% confidence intervals for the counting noise expected in the empirical frequency of 
system states are shown as dashed lines. The solid line is the identity. 
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Figure 5: Likelihood comparison as a function of model size, for spatiotemporal models with 10 
cells and a varying number of concatenated time steps. The log-likelihood per neuron increases 
as each neuron is modeled as part of a longer time sequence. This effect holds both for Ising and 
higher order models, but since the Ising model cannot capture many of the relevant dependencies, 
the increase in likelihood saturates at about 3 timesteps. 



Note that any error in estimating the partition function of the models would lead to a vertical offset 
of all points. Thus visually checking the alignment of the data cloud around the identity line pro- 
vides an intuitive verification that there are no catastrophic errors in the estimation of the partition 
function. Unfortunately we cannot use this alignment (e.g. of the most frequent all zeros state) as 
a shortcut to compute the partition function without sampling: Li regularization tends to reduce 
model probabilities of the most frequent states, so this estimate of Z would systematically overesti- 
mate the likelihood of regularized models. We note, however, that for models with no regularization 
this estimate does indeed agree well with the AIS estimate. 



3.1 Spatiotemporal models 

The same models can be used to capture spatiotemporal patterns by treating previous time steps as 
additional cells. Consecutive network states binned at 6.7ms are concatenated in blocks of up to 
10 time steps , for a total network dimensionality of 100 with 10 cells. These models were cross- 
validated and the sparseness parameters optimized in the same way as for the instantaneous model. 
This allows us to learn kernels that describe the temporal structure of interactions between cells. 

In Fig. [5] we compare the relative performance of spatiotemporal Ising and higher order models as 
a function of the number of time steps included in the model. To create the datasets, we picked 
a subset of 10 cells with the highest firing rates from the B4 dataset (4 cells from subgranular, 2 
from granuar and 4 from supergranular layers) and concatenated blocks of up to 10 subsequent data 
vectors. This way models of any dimensionality divisible by 10 can be estimated. The number of 
parameters of the RBM and Ising model were kept the same by fixing the number of hidden units 
in the RBM to be equal to the number of visible units, the sRBM was also estimated with a square 
weight matrix for the hidden layer. As before, the higher order models consistently outperform the 
Ising model. The likelihood per unit increases with the network size for all models, as additional 
information from network interactions leads to an improvement in the predictive power of the model. 
However, the curve levels off for the Ising model after a dimensionality of about 30 is reached, as 
higher order structure that is not well captured by the Ising model becomes increasingly important. 

A similar observation has been made in |4|, where Ising and higher order models for 100 retinal 
ganglion cells were compared to models for 10 time steps of 10 cells. It is noteworthy that temporal 
dependencies are similar to dependencies between different cells, in that there are strong higher order 
correlations not well described by pairwise couplings. These dependencies extend surprisingly far 
across time (at least 67ms, corresponding to the largest models estimated here) and are of such a 
form that including pairwise couplings to these states does not increase the likelihood of the model. 
This has implications e.g. for GLMs that are typically estimated with linear spike coupling kernels 
which will miss these interactions. 
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a) Subset of hidden unit filters 



b) Log-likelihood gain, cells sorted by spike rate 
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Figure 6: (a) Hidden units of spatiotemporal sRBM model. For each hidden unit, the horizontal 
axis is time and the vertical axis cells with horizontal bars separating the subgranular, granuar and 
supergranular cortical layers, (b) Log-likelihood gain for one cell conditioned on the network state 
for all three models. The remaining population carries more information about cells with higher 
firing rates, (c) Spike prediction from network history. For one of the cells, we show Is of predicted 
activity given the history of the network state. In each case when a spike occurs in the data, there is 
an elevated probability under the models. 



To predict spiking based on the network history, we can compute the conditional distribution of 
single units given the state of the rest of the network. This is illustrated for a network with 15 time 
steps for a dimensionality of 150. Fig.[6]a) shows the learned weights of 18 randomly chosen hidden 
units for a spatiotemporal RBM model with 150 hidden units. Each subplot corresponds to one 
hidden unit, which connects to 10 neurons (vertical axis) across 15 time steps or 100ms (horizontal 
axis). Some units specialize in spatial coupling across different cells at a constant time lag. The 
remaining units describe smooth, long-range temporal dependencies, typically for small groups of 
cells. Both of these subpopulations capture higher order structure connecting many neurons that 
cannot be approximated with pairwise couplings. 

By conditioning the probability of one cell at one time bin on the state of the remaining network, we 
can compute how much information about a cell is captured by the model over a naive prediction 
based on the firing rate of the cell. This conditional likelihood for each cell is plotted in|6]b) in a 
similar way to excess log likelihood for the entire population in Fig.|5] While the result here reflects 
our previous observation that Boltzmann machines with hidden units outperform Ising models, we 
note that the conditional probabilities are easily normalized in closed form since they describe a 
one-dimensional state space. Thus we can ensure that the likelihood gain holds independent of the 
estimation of Z and cannot be due to systematic errors in sampling from the high-dimensional mod- 
els. Fig.[6]c) provides a more intuitive look at the prediction. For Is of data from one cell, where 5 
spikes occur, we show the conditional firing probabilities for the 3 models given 100ms of history of 
itself and the other cells. Qualitatively, the models perform well in predicting spiking probabilities, 
suggesting it might compare favorably to prediction based on GLM-type or Ising models 12411 . 



9 



4 Discussion 



We explored the utility of Boltzmann machines with hidden units as models for neural population 
data, arguing that it provides a better model for cortical data than Ising models and previous exten- 
sions. While there has been a resurgence of interest in these maximum entropy models for describing 
neural data, progress has been hampered mainly by two problems: Estimation of energy based mod- 
els is difficult since these models cannot be normalized in closed form. Evaluating the likelihood 
of the model thus requires approximations or a numerical integral over the exponential number of 
states of the model, making maximum likelihood estimation computationally intractable. Therefore 
even the pairwise Ising model is hard to estimate in general, and various approximations have been 
used to overcome this problem. This is a purely computational difficulty, but there is a more funda- 
mental issue with generalizing models to include higher order dependencies. While this does not by 
itself make the model estimation any more difficult, in general the number of model parameters to 
be estimated is now also exponential in the size of the data. This can be dealt with either by cutting 
off dependencies at some low order, estimating only a small number of higher order coupling terms, 
or by imposing some specific form on the dependencies. 

We attempted to address both of these problems here. Parameter estimation was made tractable using 
MPF, and latent variables were shown to be an effective way of capturing high order dependencies. 
This addresses several shortcomings that have been identified with the Ising model. 

As Macke argues in 1251 . models with direct (pairwise) couplings are not well suited to model data 
recorded from cortical networks. Since only a tiny fraction of the neurons making up the circuit are 
recorded, most input is likely to be common input to many of the recorded cells rather than direct 
synapses between them. Whiles his work compares Generalized Linear Models (GLMs) such as 
models of the retina 1 26 1 and for LGN 1 27 1 to linear dynamical systems (LDS) models, the argument 
applies equally for the models presented here. 

Another shortcoming of the Ising model and some previous extensions is that the number of parame- 
ters to be estimated does not scale favorably with the dimensionality of the network. The number of 
pairwise coupling terms in GLM and Ising models scales with the square of the number of neurons, 
so with the amounts of data typically collected in electrophysiological experiments it is only possible 
to identify the parameters for small networks with a few tens of cells. This problem is aggravated by 
including higher order couplings: for example the number of third order coupling parameters scales 
with the cube of the data dimensionality. Therefore attempting to estimate these coupling parameters 
directly is a daunting task that usually requires approximations and strong regularization. 

An alternative is to focus on higher order structure in very small networks. Ohiorhenuan noted that 
Ising models fail to explain structure in cat visual cortex |7 1 and was able to model triplet correla- 
tions 1 8 1 by considering very small populations of no more than 6 neurons. However, Schneidman 
and Ganmor caution 131 that trying to model small subsets (10 cells) of a larger network to infer 
properties of the full network may lead to wrong conclusions and show that for retinal networks 
higher order correlations start to dominate only once a certain network size is reached. Therefore 
they address the same question as the present paper, i.e. how to capture n*^ order correlations with- 
out the accompanying growth in the number of free parameters in a larger network. In their 
proposed reliable interaction model, they exploit the sparseness of the neural firing patterns to ar- 
gue that most higher order coupling terms will be zero. Therefore the true distribution can be well 
approximated from a small number of these terms, which can be calculated using a simple recursive 
scheme. In practice, the main caveat is that only patterns that appear in the data many times are used 
to calculate the coupling terms. While the model by construction assigns correct relative probabil- 
ities to observed patterns, the probability assigned to unobserved patterns is unconstrained, and the 
most probable states may therefore be ones which never occur in the data. 

In contrast to these models focus sed on retinal data however, in visual cortex higher order corre- 
lations play an important role even in small networks. Yu et al. |3, 9 | show that over the scale of 
adjacent cortical columns of anesthetized cat visual cortex, small subnetworks of 10 cells are better 
characterized with a dichotomized Gaussian model than the pairwise maximum entropy distribution. 
While the dichotomized Gaussian 1281 is estimated only from pairwise statistics, it carries higher 
order correlations that can be interpreted as common Gaussian inputs |29|. However these correla- 
tions are implicit in the structure of the model and not directly estimated from the data as with the 
RBM, so it is not clear that the model would perform as well on different datasets. 
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Finally, GLMs f26l can be used to model each cell conditioned on the rest of the population. While 
mostly used for stimulus response models including stimulus terms, they are easily extended with 
terms for cross-spike coupling, which capture interactions between cells. A major limitation of 
GLMs is that current implementations can only be estimated efficiently if they are linear in the 
stimulus features and network coupling terms, so they are not easily generalized to higher order 
interactions. Two approaches have been used to overcome this limitation for stimulus terms. The 
GLM can be extended with additional nonlinearities, preserving convexity on subproblems (271. 
Alternatively, the stimulus terms can be packaged into nonlinear features which are computed in 
preprocessing and usually come with the penalty of a large increase in the dimensionality of the 
problem 1 30 1 . However, we are not aware of any work applying either of these ideas to spike history 
rather than stimulus terms. Another noteworthy drawback of GLMs is that instantaneous coupling 
terms cannot be included as this causes probabilities to diverge f25l, so instantaneous correlations 
cannot be modeled and have to be approximated using very fine temporal discretization. 

In conclusion, the RBM provides a parsimonious model for higher order dependencies in neural 
population data. Without explicitly enumerating a potentially exponential number of coupling terms 
or being constrained by only measurements of pairwise correlations, it provides a low-dimensional, 
physiologically interpretable model that can be easily estimated for populations of 100 and more 
cells. 

The connectivity patterns the RBM learns from cells simultaneously recorded from all cortical lay- 
ers are spatially localized, showing that small neural assemblies within cortical layers are strongly 
coupled. This suggests a shared computation these local networks perform on common input, while 
cells across different cortical layers participate in distinct computations and have much less coupled 
activity. This novel observation is made possible by the RBM: because each of the hidden units re- 
sponds to (and therefore learns on) a large number of recorded patterns, it can capture dependencies 
that are too weak to extract with previous models. In particular, the connectivity patterns discovered 
by the RBM and sRBM are by no means obvious from the covariance of the data or by inspecting 
the coupling matrix of the Ising model. This approach, combining a straightforward estimation pro- 
cedure and a powerful model, can be extended from polytrode recordings to capture physiologically 
meaningful connectivity patterns in other types of multi-electrode data. 
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A IVIarginalizing over hidden units 

In this section we review how the joint probability for visible and hidden units in the RBM can be 
integrated in closed form to obtain the marginal distribution for the visible units. First we rewrite 
the sum over hidden variables as a product 



^>(x, h) = ^ exp j ^ XihjWij + ^ Xiai + ^ hjbj | 
= —e^i ^^"^ n I XihjWij + hjbj J 



(10) 



(11) 



and integrate p{x.) = h)' where the sum is over all configurations of the hidden units. 

For the purpose of this derivation, we move the first term outside the product 



We can write the sum over all hidden states as nested sums over every state for each hidden unit 



p(x) = ^e^^ ^^^^ X] ( X] XihiWii + hibi ) H exp | ^ XihjWij + hjbj j . (12) 

{h} \ i J j=2 \ i J 

m states as nested sums over every state for each 

^ ... ^ exp I ^XihiWii + hibi J 
2iG{o,i} hme{o,i} \ i I 

M / \ 

JJ exp I ^ XihjWij + hjbj j . 

3=2 \ i I 



(14) 



Noting that the term we have singled out appears only in one of the sums, we rearrange to isolate 
the sum. 



p(x) =^e^^ '''''' X] ^^P ( X] XihxWix + /ii&i j 

- n ^""p ^i^j^^i + h^3 • 

/i2G{0,l} /i^G{0,l}j=2 V i ) 

Doing this for all terms we obtain a product over individual one-dimensional sums 

1 ^ / 

p(x) = -e^i ^^^^ W exp ^ XihjWij 

i=l/ijG{0,l} \ i 

1 + exp I ^ XiWij + bj 



1 ^ 



^ exp { ^ Xiai ^ log 1 + exp XiWij ~l~ ^jf I I 



(15) 
(16) 

(17) 
(18) 
(19) 
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The marginal distribution over x for the RBM is now in the form of a standard energy based model, 
with energy function 

M 

■ ^ Xiai - ^ log 



^(x) 



1 + exp 



XiWij + hj 



(20) 



The sRBM follows the same logic, since additional terms in the energy that do not depend on the 
hidden units stay outside the sum over hidden configurations in the same fashion as for the visible 
bias. 



B MPF objective 
B.l Ising model 

Here we review the derivation of the MPF objective for an Ising model, where the objective function 
consists of terms connecting the data states to all states which differ by a single bit flip. 

The general MPF objective function is given by 

where g (x, x') = g (x', x) G {0, 1} is the connectivity function, £^(x; 0) is an energy function 
parameterized by 0, and V is the list of data states. We consider the case where the connectivity 
function g (x, x') is set to connect all states which differ by a single bit flip, 

^ _ / 1 X and ^' differ by a single bit flip, Y.n - ^nl = 1 ni\ 

^ix,x; - I Q otherwise ^^^^ 

The MPF objective function in this case is 

^ /I \ 

^ (®) = E E ( 2 1^^'"' ®) " ^^"^ + ^^""^ ) ^^^^ 

where the sum over n is a sum over all data dimensions, and the bit flipping function d(x, n) G 

{-1,0,1}^ is 



d(x,n)i = 

For the Ising model, the energy function is 



i^n 
-{2xi — 1) i = n 



E = x^ Jx 



(24) 



(25) 



where xg{0,1}^,Jg TZ^^^, and J is symmetric (J = J^). The bias terms have been absorbed 
into the diagonal of the matrix J which is possible since x'^ = x holds for binary x. 



Substituting this energy into the MPF objective function, it becomes 

i [x^ Jx - (x + d(x, n))^ J(x + d(x, n))] ^ 

^ [x^Jx - (x^Jx + 2x^Jd(x, n) + d(x, n)^Jd(x, n))] j 



^ = EE-p 



EE^^P 
EE^^P 
EE^^P 

xeX> n 



[2x^Jd(x, n) + d(x, n)^ Jd(x, n)] 



2 ^ ^ ^iJin (1 '^^n) ~l~ Jnn 
i 

(2>Xyi 1) ^ ^ XiJiji ~Jnn j • 



(26) 
(27) 
(28) 

(29) 

(30) 
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Assume the symmetry constraint on J is enforced by writing it in terms of a another possibly asym- 
metric matrix J' e 7e^^^, 



J- -3' + -3'^ 
The derivative of the MPF objective function with respect to J Ms 
dK 



(31) 



- ^ exp 

i^expj {2xi-l)'^XiJii-^Ju 



{2Xm -l)xi- Sim 
{2xi -l)Xm- Sml 



(32) 



where the second term is simply the first term with indices / and m reversed. 
B.2 RBM 

After marginalizing out the hidden units, the energy function over the visible units for an RBM is 
given by: 



E(x) = -^log(l + e-^^" 



(33) 



where Wi is a vector of coupling parameters and x is the binary input vector. Bias terms have been 
omitted for readability. 



As previously, we substitute into the objective function Eq. [23] to obtain 

= ^ ^ exp [- ^ log (1 + e-^'-) + Y: log (l + e-^^+^'^^^e^-))] ) . 



(34) 



Unlike for the Ising model there is no cancellation of data and non-data energy terms, so evaluating 
the function and derivative requires looping over all bit flips for the data set. 

B.3 sRBM 

The energy function over the visible units for an sRBM obtained by marginalizing out the condi- 
tionally independent hidden units is 



^(x; J, W) = x^Jx - ^ log(l + e-^^^") 



(35) 



where x G {0, 1} is the visible state, J = G Jl^^^ is a symmetric coupling matrix, and 
W G T^^x^ is a weight matrix to M hidden units. Equation 35 consists of a term capturing 
connections between visible units (an Ising model), and a term capturing connections to hidden 
units (an RBM). 

The MPF objective we use again consists of energy differences between data and non-data states 
differing by a single bit. For the RBM this energy difference with the n^^ bit flipped is 



[log(l + e-^?^^) - log(l + e-^?^^^+^^^'^^^) 

i 

J2H{l + e'')-log{e''+e^'-'-)] 



(36) 
(37) 



where for notational simplicity we have defined Zi = wj^x and 6 = 2x — 1. The energy difference 
contributed by connections between visible units (the Ising model) is 



(38) 
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where we define the shorthand y = Jx for simpHcity. The total objective function is then given by 
a sum over samples and bit flips as 

^1 



K= ^^exp 



AdEl+dEl) 



To compute the gradient of this objective w.r.t. the parameters W and J we not that 



these terms are computed as 



-^dE^ - (2b y --J ^ -2b X -- 



for the pairwise terms, and 

d 



=-E- 

2 ^ 1 



1 V- 



2 e^- + e^--^- 



j 



for the higher order terms. 



(39) 

(40) 
(41) 

(42) 

(43) 
(44) 
(45) 
(46) 



C Annealed importance sampling 



Estimating the normalization constant, also referred to as partition function, of an energy based 
probabilistic model, remains a challenging task |23|. Many approaches to make learning the pa- 
rameters of energy based models tractable, such as Contrastive Divergence, MPF, Score and Ratio 
Matching |3T] |32l do not attempt to estimate the partition function. A notable exception is Noise 
Contrastive Estimation fSSl which treats the partition function as a parameter to be estimated, but has 
only been applied for continuous-valued data. Most commonly the partition function is estimated 
by sampling. 

Using importance sampling, the partition function can be estimated by 

where Zq is the known partition function of the proposal distribution g(x), Zp is the partition func- 
tion of interest for p(x) and the symbol ~ indicates a non-normalized distribution. The angle brack- 
ets indicate a sample expectation over samples from the distribution p(x). However, if g'(x) is not a 
good match to the target distribution, it takes a very large number of samples to get a good estimate. 
AIS uses an annealing process to gradually transform a simple proposal distribution, such as the 
uniform distribution, into the target distribution, leading to an accurate estimate of Z from only a 
small number of samples. 

To assure convergence of the estimator, we run several annealing chains, increasing the number of 
steps in factors of 2 up to a size of 10^ steps. We check that the final estimate of log2(^) does 
not deviate more than 0.02 from the previous estimates. This criterium was chosen since log2(^) 
appears as an additive term to £, and at a bin size r = 50ms an error of 1 bits / second in the 
final estimate of the likelihood was seen as an acceptable trade-off between estimation speed and 
accuracy. In Fig. [7] we show this convergence plot for a small 20-dimensional model, where the 
normalization constant was computed exactly. For larger models, where the partition function could 
not be calculated analytically, we monitored that the estimate stabilized to within this tolerance. 
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Figure 7: Monitoring the convergence of the AIS estimate for the partition function, here for a 
20-dimensional Ising model. Each entry on the horizontal axis corresponds to an annealing chain 
with a different number of steps. Points correspond to the 500 individual samples, the blue line 
is the log2 of the average from the samples. The solid green line is the true value of the partition 
function computed numerically by summing over the 2^^ states. The dashed lines correspond to our 
convergence criterion of 0.02 deviation from the true partition function. 
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